Skip to content

feat(consensus): persist standalone WAL with crash-safe replay#3763

Merged
ongyimeng merged 17 commits into
mainfrom
fix/tendermint-use-pebble-wal
Jul 2, 2026
Merged

feat(consensus): persist standalone WAL with crash-safe replay#3763
ongyimeng merged 17 commits into
mainfrom
fix/tendermint-use-pebble-wal

Conversation

@ongyimeng

@ongyimeng ongyimeng commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Replaces the consensus WAL, previously stored as generic CBOR-encoded values in the shared Pebble KV store, with a dedicated append-only WAL owned by consensus. The new WAL uses typed records, segment files, pruning, and crash-safe replay.

Why

Consensus relies on WAL replay to recover after a crash mid-height. As generic KV values, that recovery path was harder to inspect and debug, and was coupled to the shared KV store. A standalone WAL gives consensus an explicit durability boundary with clearer replay and failure handling.

Issue #3114

What changed

  • New consensus WAL:
    • typed record codec
    • append-only segments with rotation and tail repair
    • persisted prune watermarks
    • live-entry indexes
  • Replaced the Pebble KV-backed WAL and replay path.
  • Fixed Start entry replay. It already existed as an entry type, but the old generic CBOR decoding couldn't reconstruct it on replay.
  • Improved error handling across WAL writes/flushes, commit persistence, and replay.

Performance

Headline comparison is BenchmarkTendermintWALStoreSustainedHeightLatency, which is closest to the consensus write/delete flow: 1000 heights, real cleanup frequency, count=5.

Medians:

metric old WAL (KV) new WAL result
p50 155ms 153ms tie
p95 192ms 177ms ~7.8% faster
p99 239ms 203ms ~15.1% faster
max 298ms 299ms tie
B/op 1.77 MB 1.47 MB ~17% less
allocs/op 41.1k 28.2k ~31% fewer

Same latency at p50/max, better p95/p99, fewer allocations.

B/op and allocs/op are per 1000-height run, since Go counts the loop as one op under -benchtime=1x.

Note: This benchmark uses 1,000 heights, which fits within Pebble’s in-memory write path. It does not exercise the old KV-backed WAL’s eventual flush, compaction, or tombstone-reclamation costs. There is no steady-state latency regression at this scale, but it is not representative of recurring background costs in a long-running node which is opaque and state dependent.

Testing

  • Codec, replay, WAL store/writer, pruning, and failure-injection tests.
  • Restart test: node recovers via WAL replay with peers stopped and no committed block.
  • Benchmarks for entry durability, full-height writes, cleanup, replay, and sustained latency.

The remaining coverage gaps are mostly rare internal error paths. The main WAL recovery flows are already covered, so extra tests would add little value.

Comment thread consensus/db/wal_writer.go Fixed
Comment thread consensus/db/wal_writer.go Fixed
@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 76.45740% with 210 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.37%. Comparing base (81d1e20) to head (892cec6).

Files with missing lines Patch % Lines
consensus/walstore/codec.go 78.11% 37 Missing and 28 partials ⚠️
consensus/walstore/wal_store.go 76.54% 21 Missing and 17 partials ⚠️
consensus/walstore/replay.go 52.38% 20 Missing and 10 partials ⚠️
consensus/walstore/wal_writer.go 78.87% 17 Missing and 13 partials ⚠️
consensus/walstore/prune_watermark.go 58.33% 13 Missing and 7 partials ⚠️
consensus/driver/driver.go 60.00% 5 Missing and 1 partial ⚠️
consensus/walstore/wal_index.go 86.95% 3 Missing and 3 partials ⚠️
db/memory/db.go 60.00% 3 Missing and 3 partials ⚠️
consensus/driver/commit_listener.go 78.57% 3 Missing ⚠️
consensus/consensus.go 92.85% 1 Missing and 1 partial ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3763      +/-   ##
==========================================
+ Coverage   75.30%   75.37%   +0.06%     
==========================================
  Files         433      438       +5     
  Lines       38664    39463     +799     
==========================================
+ Hits        29116    29744     +628     
- Misses       7547     7647     +100     
- Partials     2001     2072      +71     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ongyimeng ongyimeng force-pushed the fix/tendermint-use-pebble-wal branch from e8c340a to 0383582 Compare June 25, 2026 04:39
@ongyimeng ongyimeng marked this pull request as ready for review June 25, 2026 05:29
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Claude finished @ongyimeng's task in 6m 35s —— View job


PR Review: Standalone consensus WAL

  • Read core WAL files (codec, record, wal_writer, wal_store, wal_index, prune_watermark, replay)
  • Review consensus/driver integration & db layer changes
  • Analyze for correctness, concurrency, resource leaks, error handling
  • Post inline comments + summary

Reviewed the full change set with a focus on consensus-recovery correctness, durability ordering, concurrency, and resource handling. This is a well-engineered PR — the durability invariants are sound and the failure-injection test coverage is unusually thorough. I have no blockers.

Blocker

None.

Important

None. A few things I verified that are correct and worth recording:

  • WAL tail repair uses physical offsets consistently. appendSync stores the offset returned by Writer.WriteRecord and recoverLatestWALTail uses offset.Physical / PhysicalFile; the fake writer in tests mirrors record.Writer physical offsets, so truncation targets line up. (wal_writer.go:82, :230)
  • appendSync won't deadlock on the sync WaitGroup. On a WriteRecord error it returns before waitGroup.Wait(), so a never-signalled Done can't hang the flush. (wal_writer.go:67-76)
  • Prune durability ordering is crash-safe. Watermark is fsync'd before WAL files are deleted (wal_store.go:210-217), and a stale watermark is harmless because the surviving WAL prune records re-drive pruneLiveEntriesUpTo on replay (replay.go:113). Recovery at any crash point stays consistent.
  • Persisted chan error (buffer 1) is leak-free. Each sync.go path sends exactly once, and the buffer means a producer never blocks even when OnCommit returns early on ctx.Done() without draining. (sync.go:324/349/359/372, commit_listener.go:71)
  • putFixedUvarint32 round-trips correctly for the full uint32 range as a fixed 5-byte uvarint (verified the bit layout incl. the terminating buf[4] = value>>28). (codec.go:135)
  • WAL store is fully mutex-guarded; LoadAllEntries snapshots under lock; replay (isReplaying) correctly suppresses re-writing WAL entries, avoiding duplication.

Nit

  1. db/memory/db.go:113Path() panics on MkdirTemp failure and silently makes the "persistent" WAL ephemeral under an in-memory DB. Details in the inline comment. Fix this →
  2. wal_index.go:51panic(err) in updateIndexesFromCommittedRecords. These records were just built via setEntry, so entry() can't realistically fail, but a panic on the post-commit consensus path is a sharp edge. Consider surfacing it as a returned error from flushLocked instead.
  3. consensus.go:93-103 — WAL store leak on a future Init error. No path between NewTendermintWALStore and the return currently fails, but the store opens file handles / a Pebble WAL manager and is only closed via Driver.Run. If a fallible step is ever added below it, it'd leak; a deferred-close-on-error guard would harden this.
  4. wal_store.go:24 — WAL retention. With cleanupPruneRecordInterval = 256 and ~one prune/height, up to ~256 heights of sealed WAL accumulate between cleanups. Intentional and documented (amortizes fsyncs), just flagging the disk-footprint tradeoff for long-running nodes.

Test quality is good: channel-based synchronization (no time.Sleep), meaningful assertions, and the failure-injection suite covers the rare error paths the codecov gaps point at.
• Branch: fix/tendermint-use-pebble-wal

Comment thread db/memory/db.go
@ongyimeng ongyimeng requested a review from rodrodros June 25, 2026 05:56
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Claude finished @ongyimeng's task in 2m 1s —— View job


Re-review: standalone consensus WAL (synchronize)

Re-reviewed the three commits pushed since the last review. They directly address the prior nits, and the changed packages build cleanly.

  • fix: handle WAL file close errors (a188a81)
  • fix(db/memory): avoid panic on temp path creation (ef5ed8f)
  • test: cover WAL record / prune watermark decode (0383582)
  • Check for regressions / new issues

Verdict

All three changes are correct. No blockers, no new issues.

repairWALTail close handling (wal_writer.go:181-209) — correctly converted to a named return (err error) with a deferred errors.Join on file.Close(). The earlier-return error paths (Stat/Truncate/Sync) are preserved and the close error is now joined rather than dropped. This resolves the CodeQL "writable file handle closed without error handling" finding. ✅

Path() no longer panics (db/memory/db.go:112-115) — returns "" on MkdirTemp failure instead of panic. Callers already treat "" as "no local path" (NewTendermintWALStore returns a clean error), so a transient TMPDIR failure now degrades gracefully instead of crashing the node from an infallible accessor. ✅

New decode-path tests (prune_watermark_test.go, codec_test.go, export_test.go) are meaningful — round-trip plus corruption cases (wrong size, bad header) assert real error behavior, using t.TempDir() with no flaky sleeps. ✅

Nit (non-blocking, optional)

  • db/memory/db.go Path() — the prior review also suggested a doc comment noting that under an in-memory backing DB the consensus WAL lives in an ephemeral temp dir that is RemoveAll'd on Close(), so there is no crash recovery in that configuration. The panic fix is the important part and is done; a one-line contract comment on Path() would still help future readers understand the silent ephemeral-WAL behavior. Add the comment →

Overall: the PR remains in good shape — durability invariants intact, the prior review's concrete items are now closed.

@rodrodros rodrodros left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments

Comment thread consensus/walstore/codec.go
Comment thread consensus/db/codec.go Outdated
Comment thread consensus/walstore/codec.go Outdated
Comment thread consensus/walstore/codec.go
Comment thread consensus/db/codec.go Outdated

@rodrodros rodrodros left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review

Comment thread consensus/db/codec.go Outdated
Comment thread consensus/db/codec.go Outdated
Comment thread consensus/walstore/codec.go
Comment thread consensus/db/codec.go Outdated
Comment thread consensus/walstore/codec.go
Comment thread consensus/db/codec.go Outdated
Comment thread consensus/walstore/codec.go
Comment thread consensus/walstore/codec.go
Comment thread consensus/db/init_test.go Outdated
Comment thread sync/data_source.go
@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Claude finished @ongyimeng's task in 2m 27s —— View job


PR Re-review: standalone consensus WAL (synchronize)

Re-reviewed the commits pushed since the last review:

  • test: cover WAL record and prune watermark decode paths (0383582)
  • refactor: use constant for estimatedRecordBytes (daddcff)
  • refactor: pass WalRecordEnvelope by reference (1ad21e6)
  • Re-check core WAL files for correctness/concurrency
  • Verify the refactors introduce no aliasing/regression
  • Build affected packages

These directly address @rodrodros's review nits. Affected packages (consensus/db, db/memory) build cleanly.

Verdict

All changes are correct. No blockers, no new issues.

encodeBatch index-based header write (consensus/db/codec.go:52-78) — the previous append-based header construction was replaced with slices.Grow + explicit offset writes. I verified the byte layout is byte-for-byte identical to the original:

offset bytes field
0 (headerIndex) 1 InternalKeyKindSet
1 (keyLenOffset) 1 key length (4)
2 (keyOffset) 4 record key = uint32(i) (big-endian)
6 (valueLenOffset) 5 value length (fixed uvarint, backfilled)
11 (recordHeaderBytes) payload

slices.Grow(encodedBatch, recordHeaderBytes) guarantees cap ≥ headerIndex+11, so encodedBatch[:valueIndex] is safe, and all 11 reserved header bytes are unconditionally overwritten (kind/keyLen by assignment, key by PutUint32, value-length by putFixedUvarint32), so stale bytes from reused capacity can't leak. ✅

appendWALRecordPayload(payload, *walRecordEnvelope) (codec.go:83-85) — now takes the envelope by pointer to avoid copying the large struct. The function only reads record.* fields, never mutates or retains them, and &records[i] in the loop is a read-only borrow, so there's no aliasing hazard. The single test caller (export_test.go:31) was updated to pass &envelope. ✅

estimatedRecordBytes const (codec.go:40) — folding it into the const block with derived offset constants is purely mechanical and computes the same value (11 + 128 = 139). ✅

Path() returns "" on MkdirTemp failure (db/memory/db.go:114) — the fmt import was dropped along with the panic; matches the graceful-degradation behavior agreed in the prior review. ✅

The error-message wording flagged earlier (codec.go:176) is also now the clearer "trailing WAL record bytes: %d".

Overall: the PR remains in good shape. The durability invariants verified in the earlier reviews are untouched, and the refactors are behavior-preserving.
• Branch: fix/tendermint-use-pebble-wal

@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Claude finished @ongyimeng's task in 2m 44s —— View job


PR Re-review: standalone consensus WAL (synchronize)

Reviewed the commits pushed since the last review:

  • refactor: remove impossible WAL payload size check (2f5c590)
  • refactor: remove impossible WAL payload nil checks (f75177a)
  • refactor: pass WAL proposal value by reference (5884660)
  • fix: clarify remaining bytes error (afe4c34)
  • refactor: rename consensus db package to walstore (bd2d95d)
  • Verify build + check for regressions

These address @rodrodros's remaining review nits. consensus/... and db/... build cleanly, and there are no leftover references to the old consensus/db import path.

Verdict

All changes are correct and behavior-preserving. No blockers, no new issues.

Package rename dbwalstore (consensus/walstore/*) — purely mechanical: package clause + every importer (consensus.go, driver/driver.go, tests) updated consistently. No stale consensus/db references remain anywhere in the tree. ✅

Removed nil guards in appendWALRecordPayload (codec.go:89-103) — safe. Every envelope reaches the encoder via setEntry (record.go:39-80), which sets EntryKind and its matching pointer (ProposalEntry/PrevoteEntry/…) together and rejects a nil typed entry up front. So a given EntryKind branch can never observe a nil payload pointer; the guards were dead code. The proposal path also still has its own Value == nil check before dereferencing. ✅

Removed valueLen > math.MaxUint32 check (codec.go:71) — a per-record value is a fixed 4×uint64 array (≤ a few hundred bytes after framing), so it can never approach MaxUint32. The genuinely-meaningful bound, len(records) > math.MaxUint32, is retained at codec.go:27, so math/errors imports are still used and the build is intact. ✅

appendValue / valueToUint64Array now take *V (codec.go:325, :429) — the sole caller passes proposalMessage.Value (already a *V), which is guaranteed non-nil by the preceding if proposalMessage.Value == nil branch, so the new reflect.ValueOf(value).Elem() can't panic on a nil pointer. The helper only reads the array via reflection — no mutation or retention — so passing by pointer avoids the 4-word copy with no aliasing hazard. ✅

Error wording (codec.go:162) — now "unexpected remaining WAL record bytes: %d", matching the prior request for a clearer message. ✅

Overall: the PR remains in good shape. The durability/replay invariants verified in earlier reviews are untouched, and this batch is a clean set of nit-resolving refactors.
• Branch: fix/tendermint-use-pebble-wal

@rodrodros rodrodros left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment thread consensus/walstore/wal_index.go Outdated
Comment thread consensus/walstore/wal_index.go Outdated
@ongyimeng ongyimeng requested a review from RafaelGranza July 1, 2026 08:02
@ongyimeng

Copy link
Copy Markdown
Contributor Author

Please hold on for abit, I’m still doing some final checks and cleaning up a few things.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Claude finished @ongyimeng's task in 2m 41s —— View job


PR Re-review: standalone consensus WAL (synchronize)

Reviewed the commits pushed since the last review:

  • refactor: combine duplicate WAL number checks (b824b41)
  • refactor: pass WAL uint64 arrays by reference (9d22903)
  • refactor: remove redundant WAL entry checks (d472a8e)
  • Verify build + check for regressions

These close out the last of @rodrodros's review nits. consensus/... builds cleanly.

Verdict

All changes are correct and behavior-preserving. No blockers, no new issues.

appendUint64Array now takes *T (codec.go:317) — avoids copying the 4-word array per call. Both call sites pass a guaranteed-non-nil address:

  • appendMessageHeader passes &header.Senderheader is a value parameter, so the field is addressable. ✅
  • appendVotePayload passes vote.ID only after the if vote.ID == nil { return … } guard at codec.go:301, so the pointer is never nil when dereferenced. ✅

The helper only reads (*value)[i] — no mutation or retention — so there's no aliasing hazard.

entry() drops the per-kind nil checks and returns a plain wal.Entry (panic on unknown kind) (record.go:82-97) — consistent with the other "impossible check" removals in this PR, and safe:

  • On the encode side, every envelope is built by setEntry (record.go:39-77), which sets EntryKind and its matching typed pointer together.
  • On the decode side (codec.go:178-214), the kind switch sets each typed pointer from decodeProposalRecord / decodeVotePayload / decodeTimeoutRecord, all of which return a non-nil pointer whenever err == nil (verified at codec.go:246, :266, :290). An unknown kind returns an error at codec.go:213 before any envelope with that kind escapes decode, so entry()'s default: panic is genuinely unreachable in practice.

So the two callers — replay.go:105 and wal_index.go:47 — can't observe a nil entry, and dropping the returned error (and the prior panic(err) in updateIndexesFromCommittedRecords) is correct. ✅

addIfMissing combined conditionals (wal_index.go:23) — s.first == walNum || slices.Contains(s.rest, walNum) is logically identical to the two separate ifs it replaced; short-circuit order is preserved. ✅

Overall: the PR remains in good shape. The durability/replay invariants verified in the earlier reviews are untouched, and this batch is a clean set of nit-resolving refactors.
• Branch: fix/tendermint-use-pebble-wal

Comment thread db/memory/db_test.go
Comment thread db/memory/db_test.go Outdated
Comment thread db/database.go
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Claude finished @ongyimeng's task in 1m 51s —— View job


PR Re-review: standalone consensus WAL (synchronize)

Reviewed the two commits pushed since the last review:

  • refactor: clarify WAL batch record indexing (595f11f)
  • test: assert memory DB copy path after copy (1afb6fd)
  • Check for regressions / new issues
  • Build affected packages

consensus/walstore and db/memory build cleanly.

Verdict

Both changes are correct and behavior-preserving. No blockers, no new issues.

encodeBatch variable rename + comment (consensus/walstore/codec.go:31-78) — purely cosmetic. keyByteskeySizeBytes and the headerIndex/valueLenIndex/valueIndex locals were renamed to recordStart/keyLenStart/keyStart/valueLenStart/payloadStart, and a layout comment ([kind:1][keyLen:1][key:4][valueLen:5][payload...]) was added. The computed offsets are byte-for-byte identical to before:

local value field
recordStart len(encodedBatch) start of record
keyLenStart +1 keyLen byte (= keySizeBytes = 4)
keyStart +2 4-byte record index key
valueLenStart +6 5-byte fixed uvarint value length
payloadStart +11 (recordHeaderBytes) payload

slices.Grow(…, recordHeaderBytes) still guarantees capacity for the 11-byte header, encodedBatch[:payloadStart] is safe, and putFixedUvarint32(encodedBatch[valueLenStart:payloadStart], …) still writes exactly the reserved 5 bytes. No logic change. ✅

TestMemoryDBCopyGetsIndependentPath (db/memory/db_test.go:30-36) — implements @RafaelGranza's nit: drops the pre-captured originalPath and asserts require.NotEqual(t, memoryDB.Path(), copyPath) at the point of comparison. This is a stronger check — Copy() builds a fresh Database with an empty path (db.go:230-243), and each Path() call lazily MkdirTemps an independent dir (db.go:112-118), so the two paths are guaranteed distinct. Reading memoryDB.Path() after the copy also guards against any future change where Copy() might carry the parent's path forward. ✅

Overall: the PR remains in good shape. The durability/replay invariants verified in earlier reviews are untouched, and this batch is a clean readability + test-hardening pass.

• Branch: fix/tendermint-use-pebble-wal

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Claude finished @ongyimeng's task in 2m 32s —— View job


PR Re-review: standalone consensus WAL (synchronize)

Reviewed the change pushed since the last review:

  • Identify new commits since last review
  • Check the main merge for conflicts/regressions in consensus/walstore
  • Verify affected packages build
  • Report findings

The only new commit since the last review is the main merge (892cec6e). It brings main up to date but does not touch any WAL code — the walstore package, driver integration, and db layer changes are all untouched.

Verdict

No blockers, no new issues. The WAL durability/replay invariants verified across the earlier reviews are unchanged.

Merge adaptations — within the PR's scope, the merge only removed a now-deleted CandidateTxs field from three struct literals to match main:

file change
consensus/p2p/validator/empty_fixtures_test.go:130 drop CandidateTxs: []core.Transaction{}
consensus/p2p/validator/fixtures_test.go:325 drop CandidateTxs: []core.Transaction{}
sync/helpers.go:116 drop CandidateTxs: make([]core.Transaction, 0)

These are purely mechanical field removals tracking the upstream deletion of CandidateTxs, unrelated to the WAL change. ✅

Build./consensus/... ./sync/... ./db/... all compile cleanly after the merge. ✅

Overall: the merge is clean and the PR remains in good shape. Nothing further to flag.
• Branch: fix/tendermint-use-pebble-wal

@ongyimeng ongyimeng enabled auto-merge (squash) July 2, 2026 01:54
@ongyimeng ongyimeng disabled auto-merge July 2, 2026 01:56
@ongyimeng ongyimeng enabled auto-merge (squash) July 2, 2026 01:56
@ongyimeng ongyimeng merged commit 2095502 into main Jul 2, 2026
30 of 31 checks passed
@ongyimeng ongyimeng deleted the fix/tendermint-use-pebble-wal branch July 2, 2026 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants